1: Limited English Speaking
We decided to explore how English proficiency is related to health
insurance coverage. According to a study on the relationship between
English proficiency and healthcare access, “due to legal and policy
contexts governing access to resources, adults with LEP, who are less
likely to be US-born, are subjected to stricter health insurance
coverage eligibility requirements and exposed to greater complexity and
administrative burden in determining eligibility and acquiring and
maintaining health insurance coverage” (Ramirez et al,
2023). Based on these previous findings, we predicted a negative
correlation between English proficiency and private health insurance
coverage.
In this dataset, the percentage of limited English-speaking
households per county is encoded as the variable
“ACS_PCT_HH_LIMT_ENGLISH.” The summary() function provides a basic sense
of the distribution of this variable.
#Summary statistics for % of limited English-speaking households
summary(sdoh$ACS_PCT_HH_LIMIT_ENGLISH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.290 0.870 3.469 2.160 90.340
Since the mean (3.5%) is above the median (0.87%), this indicates
that the distribution is right-skewed. This means that most counties
have a relatively low percentage of limited English-speaking households,
and counties with a very high percentage of limited English-speaking
households are considered outliers.
This graph clearly depicts the right-skewed distribution of this
variable demonstrated by the summary statistics. It is clear that the
percentage of limited English-speaking households is below 25% for the
majority of the counties in this dataset.
Next, two methods were used to evaluate the relationship between
limited English-speaking and health insurance coverage: a visual method
(by graph) and a quantitative method (by calculating the correlation
coefficient). For the graph, we used the function geom_point(), which
creates a scatter plot. Within this function, the color, size, and
opacity of the data being plotted was also customized. Finally, x and y
axis labels, as well as a title, was added using the xlab(), ylab(), and
ggtitle() functions.
#Correlation between limited English-speaking and private insurance coverage
ggplot(sdoh, aes(x = ACS_PCT_HH_LIMIT_ENGLISH, y = ACS_PCT_PRIVATE_ANY)) +
geom_point(color = 'blue',alpha = 0.5, size = 2) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
xlab("% of limited English-speaking households") +
ylab("% of population with private insurance")+
ggtitle("Association between English-speaking and private insurance ownership")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This graph illustrates a negative correlation between the percentage
of individuals from limited English-speaking households and the
percentage of individuals with private insurance coverage by
county.
The last piece of analysis for evaluating the relationship between
limited English-speaking and insurance coverage was calculating the
correlation coefficient. This provides a quantitative measure of the
relationship that was illustrated on the scatterplot. R has a built-in
function called “cor()” that automatically calculates the degree of
association between two variables.
##Calculate correlation coefficient
cor(sdoh$ACS_PCT_HH_LIMIT_ENGLISH, sdoh$ACS_PCT_PRIVATE_ANY)
## [1] -0.3865378
The correlation coefficient was calculated as –0.39, which indicates
a moderate negative association between limited English-speaking and
insurance coverage. In other words, as the percentage of limited
English-speaking households in a county increase, the percentage of
residents with healthcare insurance coverage decreases.
The association between the percentage of individuals from limited
English-speaking households and the percentage of individuals with
private insurance coverage reflected in the graph and by the correlation
coefficient is consistent with our prediction. Further investigation is
necessary to understand the causal relationship between these variables,
and how this disparity in health insurance might negatively affect
healthcare outcomes for limited English-speaking populations.
2: Various Education Levels
In this particular dataset, the percentage of people in each county
with a designated education level is encoded through the following
variables: “ACS_PCT_LT_HS” (less than high school education),
“ACS_PCT_HS_GRADUATE” (only high school diploma), “ACS_PCT_BACHELOR_DGR”
(holds a bachelor’s degree), and “ACS_PCT_GRADUATE_DGR” (holds a
graduate or professional degree).
To begin analysis and visualize the distribution of each educational
variable, the ggplot() function used with geom_histogram produces a set
of histogram distribution plots which show the distribution of the
percentages of the population within each educational level by county in
the dataset. The pivot_longer() function was used to put all the
educational levels in one column. We named this new combined variable
“sdoh_long”, which was used consistently throughout the project when
creating a new combined variable. Next, we renamed the previous race to
respective educational levels and created a histogram plot using the
ggplot() function.
#Visualize distribution of education level in US by county
#Convert data to long format to put into a facet wrapped plot
sdoh_long <- sdoh %>%
pivot_longer(cols = c(ACS_PCT_LT_HS, ACS_PCT_HS_GRADUATE, ACS_PCT_BACHELOR_DGR, ACS_PCT_GRADUATE_DGR),
names_to = "Education_Level",
values_to = "Percentage")
#Rename education levels for better readability/understanding
sdoh_long$Education_Level <- factor(sdoh_long$Education_Level,
levels = c("ACS_PCT_LT_HS", "ACS_PCT_HS_GRADUATE", "ACS_PCT_BACHELOR_DGR", "ACS_PCT_GRADUATE_DGR"),
labels = c("Less than High School", "High School Graduate", "Bachelor's Degree", "Graduate Degree"))
#Create the faceted histogram plot with all four plots in one figure
ggplot(sdoh_long, aes(x = Percentage)) +
geom_histogram(color = "grey", fill = "light blue", bins = 30) +
facet_wrap(~Education_Level, nrow = 2) +
xlab("% of county population") +
ylab("Number of counties") +
ggtitle("Distribution of Education Levels by County in the US") +
theme_minimal()

By looking at this information, the mean is above the median for
populations with less than a high school diploma, a bachelor’s degree,
and a graduate or professional degree, indicating that each of the
distributions is slightly right skewed. This means that counties with
high percentages of these population groups are considered outliers. For
those with only a high school diploma, the median is greater than the
mean which indicates a slight left skew and that counties with lower
percentages of these population groups are considered outliers.
The next step was then to analyze the relationship between each of
these educational levels and private health insurance ownership. The
following ggplot() functions used with geom_point are responsible for
creating correlation plots for the relationship between each of the
education level variables and private insurance ownership. The ggplot()
functions are responsible for creating the plot, with the xlab(),
ylab(), and ggtitle() functions being used to describe the axes names
and title for the plot.
#Plotting all correlation association plots together on the same plot for visualization
# Convert data to long format to put all plots together
sdoh_long <- sdoh %>%
pivot_longer(cols = c(ACS_PCT_LT_HS, ACS_PCT_HS_GRADUATE, ACS_PCT_BACHELOR_DGR, ACS_PCT_GRADUATE_DGR),
names_to = "Education_Level",
values_to = "Education_Percentage")
# Rename education levels for better readability/understanding
sdoh_long$Education_Level <- factor(sdoh_long$Education_Level,
levels = c("ACS_PCT_LT_HS", "ACS_PCT_HS_GRADUATE", "ACS_PCT_BACHELOR_DGR", "ACS_PCT_GRADUATE_DGR"),
labels = c("Less than High School", "High School Graduate", "Bachelor's Degree", "Graduate Degree"))
# Create the faceted scatter plot with trend lines for each relationship
ggplot(sdoh_long, aes(x = Education_Percentage, y = ACS_PCT_PRIVATE_ANY)) +
geom_point(aes(color = Education_Level), alpha = 0.5, size = 2) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
facet_wrap(~Education_Level, nrow = 2) +
scale_color_manual(values = c("green", "blue", "red", "purple")) +
xlab("% of Population with Specified Education Level") +
ylab("% of Population with Private Insurance") +
ggtitle("Association between Education and Private Insurance Ownership") +
theme_minimal() +
theme(legend.position = "none")
## `geom_smooth()` using formula = 'y ~ x'

Based on the visual depictions seen in the correlation plots, higher
percentages of the population being made up of those with less than a
high school education and those with only a high school diploma is
associated with a decreased percentage of the population owning private
health insurance. Higher percentages of the population having a
bachelor’s, graduate, or professional degree are associated with an
increased percentage of the population owning private health insurance.
Each plot also contains a trendline to depict the general relationship
between education and private health insurance, which appears to be that
higher education levels are associated with higher levels private health
insurance ownership.
Finally, the next evaluation was to assess the correlation between
education and private health insurance ownership through a correlation
coefficient value. Using the cor() and cat() functions together also
prints the correlation coefficients for each of these
relationships.
#Print all correlation coefficients between education level and private insurance rates at once
{
cat("\n % of Population with Less than High School Education\n")
print(cor(sdoh$ACS_PCT_LT_HS, sdoh$ACS_PCT_PRIVATE_ANY))
cat("\n % of Population with Only High School Diploma\n")
print(cor(sdoh$ACS_PCT_HS_GRADUATE, sdoh$ACS_PCT_PRIVATE_ANY))
cat("\n % of Population with a Bachelor's Degree\n")
print(cor(sdoh$ACS_PCT_BACHELOR_DGR, sdoh$ACS_PCT_PRIVATE_ANY))
cat("\n % of Population with a Graduate or Professional Degree\n")
print(cor(sdoh$ACS_PCT_GRADUATE_DGR, sdoh$ACS_PCT_PRIVATE_ANY))
}
##
## % of Population with Less than High School Education
## [1] -0.6949437
##
## % of Population with Only High School Diploma
## [1] -0.2915207
##
## % of Population with a Bachelor's Degree
## [1] 0.5637536
##
## % of Population with a Graduate or Professional Degree
## [1] 0.4332939
As shown in output the correlation coefficients for the variables
including those with less or only a high school diploma have negative
coefficients (-0.69 and –0.29 respectively), whereas those with a
bachelor’s, graduate, or professional degree have positive coefficients
(0.56 and 0.43 respectively). These findings indicate that our initial
prediction of higher education being associated with increased rates of
private health insurance ownership is potentially correct. Further
investigation is needed to assess further specifics of this relationship
and how to potentially minimize this disparity.
3. Various Racial Groups
Another factor that shapes private insurance status is an
individual’s racial background. We wanted to understand the relationship
between private insurance coverage and race. According to a study
conducted by the NIH, “blacks, Hispanics, and some Asian populations,
when compared with whites, appear to have lower levels of health
insurance coverage” (National Research Council (US)
Panel on Race, Ethnicity, and Health in Later Life, 2004).
In this dataset, the percentage of each racial group is encoded as
the variables “ACS_PCT_WHITE” representing White, “ACS_PCT_BLACK”
representing Black, “ACS_PCT_ASIAN” representing Asian, “ACS_PCT_AIAN”
representing American Indian and Alaska Native and finally
“ACS_PCT_MULT_RACE” representing multiple races.
To begin, the distribution of racial groups across county was
determined. The pivot_longer() function was used to put all the races in
one column. We named this new combined variable “sdoh_long.”
Next, we renamed the previous race to reflect actual races: White,
Black, Asian, American Indian/Alaska Native, and Multiracial. Then we
created a histogram plot using the ggplot() function. The x-axis
represents the percentage of the county’s population belonging to a
racial group, while the y-axis shows the number of counties. Within this
function, the color, size, and opacity of the data being plotted can
also be customized. Finally, x and y axis labels, as well as a title,
were added using the xlab(), ylab(), and ggtitle() functions.
#visualization of each racial group in percentage for counties
#Convert data to long format
sdoh_long <- sdoh %>%
pivot_longer(cols = c(ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE),
names_to = "Race_Ethnicity",
values_to = "Percentage")
#Rename categories for better readability
sdoh_long$Race_Ethnicity <- factor(sdoh_long$Race_Ethnicity,
levels = c("ACS_PCT_WHITE", "ACS_PCT_BLACK", "ACS_PCT_ASIAN", "ACS_PCT_AIAN", "ACS_PCT_MULT_RACE"),
labels = c("White", "Black", "Asian", "American Indian/Alaska Native", "Multiracial"))
#Create a single faceted histogram plot
ggplot(sdoh_long, aes(x = Percentage)) +
geom_histogram(color = "grey", fill = "lightblue", bins = 30) +
facet_wrap(~Race_Ethnicity, nrow = 2) + # Arrange plots in a 2-row layout
xlab("% of County Population") +
ylab("Number of Counties") +
ggtitle("Distribution of Racial/Ethnic Groups by County in the US") +
theme_minimal()

Based on the histogram distribution, it is observed that each race
has a skew to its data. Looking at the white population, it has a left
skew, showing most of the data falls around 100% of the population for a
lot of counties, which indicates they are the predominate race. This is
quite the opposite for all the other racial groups, which have a right
skew showing their percentage is much closer to 0, which shows they are
the minorities and have a lower representation
Looking at the summary statistics for each race and looking at the
median, it is seen that the most represented race in the data set is
white-identifying individuals. Having a median of 87.66% in comparison
to the least represented population, American Indian/Alaska Native,
having a median of 0.320%. Looking at the % population distribution of
Whites, it is seen that they have a mean lower than the median, which
indicates a left skew, which was seen in the histogram above. In
addition, looking at the other races, they all had a mean higher than
the median, indicating a right skew. We chose to look specifically at
the median because it is resistant to outliers.
Next, we wanted to see the relationship between private insurance
coverage and racial groups in the top 5 counties with the highest
private insurance coverage rates. To achieve this, we used
arrange(desc()) and slice(1:5) to identify these counties. The dataset
was then transformed into a long format using pivot_longer(), which
allows for easier visualization. An interactive bar chart was created
using geom_bar() and ggplotly(), displaying the percentage of each
racial group in these top counties.
#Selecting the racial variables, county and private insurance
selected_racial_group_with_county <- sdoh %>%
select(COUNTY, ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE, ACS_PCT_PRIVATE_ANY)
#Identify top 5 counties with highest private insurance coverage
top_counties <- selected_racial_group_with_county %>%
arrange(desc(ACS_PCT_PRIVATE_ANY)) %>%
slice(1:5)
top_counties
## COUNTY ACS_PCT_WHITE ACS_PCT_BLACK ACS_PCT_ASIAN ACS_PCT_AIAN
## 1 Kalawao County 13.30 0.69 3.44 17.66
## 2 Falls Church city 78.01 4.82 9.88 0.09
## 3 Daggett County 97.63 0.17 0.00 0.34
## 4 Los Alamos County 84.15 0.81 4.92 0.94
## 5 Burke County 93.42 0.75 1.54 1.07
## ACS_PCT_MULT_RACE ACS_PCT_PRIVATE_ANY
## 1 2.75 98.38
## 2 4.11 87.85
## 3 0.00 85.71
## 4 7.05 85.30
## 5 2.85 83.41
#Convert race columns into long format for visualization
selected_long <- top_counties %>%
pivot_longer(cols = c(ACS_PCT_WHITE, ACS_PCT_BLACK, ACS_PCT_ASIAN, ACS_PCT_AIAN, ACS_PCT_MULT_RACE),
names_to = "Race",
values_to = "Percentage")
#Creating an interactive bar plot
p <- ggplot(selected_long, aes(x = Race, y = Percentage, fill = Race,
text = paste("County:", COUNTY,
"<br>Private Insurance:", round(ACS_PCT_PRIVATE_ANY, 2)))) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ COUNTY, scales = "free_y") + # Facet by County only
labs(title = "Race Distribution in Top 5 Counties with Highest Private Insurance",
x = NULL, # Removes x-axis label to avoid repetition
y = "Percentage of Population") +
theme_minimal() +
theme(axis.text.x = element_blank(), # Removes race labels at the bottom
axis.ticks.x = element_blank(),
strip.text = element_text(size = 10, face = "bold"), # Adjust facet title size
legend.position = "bottom") # Moves legend to bottom
#Convert to interactive plotly graph
interactive_plot <- ggplotly(p, tooltip = "text")
interactive_plot
The interactive plot follows the pattern established previously
about the distribution of each race; the most represented are Whites.
However, in Kalawao County, we see that the race with the highest
representation is AIAN, which is an interesting departure from the
observed trend. Also, in comparison to the other counties the other
races like Black, Asian and Multirace individuals, had a higher
representation in Kalawao County.
Lastly, we wanted to get the actual correlation between each racial
group and private insurance coverage. We used the cor() function to
calculate the correlation coefficient for each racial group. The cat()
function assists in printing all four summary outputs at once. The
numbers range from –1 to 1. The closer to 1, the higher the correlation
of the 2 variables. If the value is negative, it indicates that as one
variable increases, the other decreases. However, if you have a positive
value, as one increases, so does the other variable. The racial group
with the highest correlation with private insurance having a moderate
positive correlation was White, having a value of 0.42.
#Print correlation coefficients for every association between race and private insurance at once
{
cat("\nCorrelation: % White vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_WHITE, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % Black vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_BLACK, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % Asian vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_ASIAN, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % American Indian/Alaska Native vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_AIAN, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
cat("\nCorrelation: % Multiracial vs. Private Insurance\n")
print(cor(sdoh$ACS_PCT_MULT_RACE, sdoh$ACS_PCT_PRIVATE_ANY, use = "complete.obs"))
}
##
## Correlation: % White vs. Private Insurance
## [1] 0.4260394
##
## Correlation: % Black vs. Private Insurance
## [1] -0.2949309
##
## Correlation: % Asian vs. Private Insurance
## [1] 0.2303914
##
## Correlation: % American Indian/Alaska Native vs. Private Insurance
## [1] -0.2617146
##
## Correlation: % Multiracial vs. Private Insurance
## [1] -0.1810702
Understanding the correlation between racial groups and private
insurance will help us improve healthcare equity and access. We had a
prediction of race and private insurance coverage having a strong
correlation, with Whites having the strong positive correlation and the
other races having a negative strong correlation. However, our findings
proved otherwise; we saw a moderate positive correlation between White
and private insurance coverage and Asians having a low positive
correlation (0.23). Blacks and American Indian/Alaska Native both had a
negative low correlation (-0.29 and 0.261, respectively). This might be
because certain racial groups are more likely to have public insurance
coverage.
1. Social Context = demographics, socioeconomic status